Ames Housing Data and Kaggle Challenge

Problem Statement

Housing is an important investment tool in the toolkit of any investors. However, with a multiplicity of factors when choosing a house, it is often difficult to pinpoint the most important factors influencing sale price of houses.

To help investors and homeowners of Ames pick the best house and the most worthwhile upgrades in order to optmise sale price, we will build a predictor which predicts the sale price base on various aspects of a house, from floor area to roofing type, before identifying some of the most important factors which helps owners maximise the sale value of their house.

Role taken: Advisor to potential property investors and homeowners

Background

Ames is a city located in in Story County, Iowa, USA. With a population of 66,427 in 2020, it is only the 6th largest city in Iowa, and considered a small size city. Ames has a booming economy, in 2015, it was ranked one of the top 15 "Cities That Have Done the Best Since the Recession" by Bloomberg Businessweek (Source), it also has one of the lowest unemployment rates in the united states.

Data Preparation and Cleaning

Libraries and Settings

Useful Function

Data import

Data First Look

Data Dictionary

The data dictionary consists of Feature Label, Description and Categories.

Also, for the purpose of our analysis, we have added Aspect of house which is the particular component/aspect of the house which the feature belongs to, and the Data Type which tells us if the feature is nominal, ordinal or numeric

Feature Label Description Categories Aspect of house Data Type
Id Unique ID None Administrative None
PID Parcel Identification Number assigned to each property within the Ames None Unrelated None
MS SubClass The building class 20: 1-STORY 1946 & NEWER ALL STYLES / 30: 1-STORY 1945 & OLDER / 40: 1-STORY W/FINISHED ATTIC ALL AGES / 45: 1-1/2 STORY - UNFINISHED ALL AGES / 50: 1-1/2 STORY FINISHED ALL AGES / 60: 2-STORY 1946 & NEWER / 70: 2-STORY 1945 & OLDER / 75: 2-1/2 STORY ALL AGES / 80: SPLIT OR MULTI-LEVEL / 85: SPLIT FOYER / 90: DUPLEX - ALL STYLES AND AGES / 120: 1-STORY PUD (Planned Unit Development) - 1946 & NEWER / 150: 1-1/2 STORY PUD - ALL AGES / 160: 2-STORY PUD - 1946 & NEWER / 180: PUD - MULTILEVEL - INCL SPLIT LEV/FOYER / 190: 2 FAMILY CONVERSION - ALL STYLES AND AGES Type Nominal
MS Zoning Identifies the general zoning classification of the sale A: Agriculture / C: Commercial / FV: Floating Village Residential / I: Industrial / RH: Residential High Density / RL: Residential Low Density / RP: Residential Low Density Park / RM: Residential Medium Density Administrative Nominal
Lot Frontage Linear feet of street connected to property None Environment Numeric
Lot Area Lot size in square feet None Land Numeric
Street Type of road access to property Grvl: Gravel / Pave: Paved Environment Nominal
Alley Type of alley access to property Grvl: Gravel / Pave: Paved / NA: No alley access Environment Nominal
Lot Shape General shape of property Reg: Regular / IR1: Slightly irregular / IR2: Moderately Irregular / IR3: Irregular Land Nominal
Land Contour Flatness of the property Lvl: Near Flat/Level / Bnk: Banked - Quick and significant rise from street grade to building / HLS: Hillside - Significant slope from side to side / Low Depression Land Nominal
Utilities Type of utilities available AllPub: All public Utilities (E,G,W,& S) / NoSewr: Electricity, Gas, and Water (Septic Tank) / NoSeWa: Electricity and Gas Only / ELO: Electricity only Utility Nominal
Lot Config Lot configuration Inside: Inside lot / Corner: Corner lot / CulDSac: Cul-de-sac / FR2: Frontage on 2 sides of property / FR3: Frontage on 3 sides of property Environment Nominal
Land Slope Slope of property Gtl: Gentle slope / Mod: Moderate Slope / Sev: Severe Slope Land Nominal
Neighborhood Physical locations within Ames city limit Value represents unique neighbourhoods in Ames Environment Nominal
Condition 1 Proximity to main road or railroad Artery: Adjacent to arterial street / Feedr: Adjacent to feeder street / Norm: Normal / RRNn: Within 200' of North-South Railroad / RRAn: Adjacent to North-South Railroad / PosN: Near positive off-site feature--park, greenbelt, etc. / PosA: Adjacent to postive off-site feature / RRNe: Within 200' of East-West Railroad / RRAe: Adjacent to East-West Railroad Environment Nominal
Condition 2 Proximity to main road or railroad (if a second is present) Same as Condition 1 Environment Nominal
Bldg Type Type of dwelling 1Fam: Single-family Detached / 2FmCon: Two-family Conversion; originally built as one-family dwelling / Duplx: Duplex / TwnhsE: Townhouse End Unit / TwnhsI: Townhouse Inside Unit Type Nominal
House Style Style of dwelling 1Story: One story / 1.5Fin: One and one-half story: 2nd level finished / 1.5Unf: One and one-half story: 2nd level unfinished / 2Story: Two story / 2.5Fin: Two and one-half story: 2nd level finished / 2.5Unf: Two and one-half story: 2nd level unfinished / SFoyer: Split Foyer / SLvl: Split Level Overall Nominal
Overall Qual Overall material and finish quality 10 Very Excellent / 9 Excellent / 8 Very Good / 7 Good / 6 Above Average / 5 Average / 4 Below Average / 3 Fair / 2 Poor / 1 Very Poor Overall Ordinal
Overall Cond Overall condition rating Same as Overall Qual Overall Ordinal
Year Built Original construction date None Age Numeric
Year Remod/Add Remodel date (same as construction date if no remodeling or additions) None Age Numeric
Roof Style Type of roof Flat: Flat / Gable: Gable / Gambrel: Gabrel (Barn) / Hip: Hip / Mansard: Mansard / Shed: Shed Roof Nominal
Roof Matl Roof material ClyTile: Clay or Tile / CompShg: Standard (Composite) Shingle / Membran: Membrane / Metal: Metal / Roll: Roll / Tar&Grv: Gravel & Tar / WdShake: Wood Shakes / WdShngl: Wood Shingles Roof Nominal
Exterior 1st Exterior covering on house AsbShng: Asbestos Shingles / AsphShn: Asphalt Shingles / BrkComm: Brick Common / BrkFace: Brick Face / CBlock: Cinder Block / CemntBd: Cement Board / HdBoard: Hard Board / ImStucc: Imitation Stucco / MetalSd: Metal Siding / Other: Other / Plywood:Plywood / PreCast: PreCast / Stone: Stone / Stucco: Stucco / VinylSd: Vinyl Siding / Wd Sdng: Wood Siding / WdShing: Wood Shingles Facade Nominal
Exterior 2nd Exterior covering on house (if more than one material) Same as Exterior 1st Facade Nominal
Mas Vnr Type Masonry veneer type BrkCmn: Brick Common / BrkFace: Brick Face / Cblock: Cinder Block / None: None / Stone: Stone Facade Nominal
Mas Vnr Area Masonry veneer area in square feet None Facade Numeric
Exter Qual Exterior material quality Ex: Excellent / Gd: Good / TA: Average/Typical / Fa: Fair / Po: Poor Facade Ordinal
Exter Cond Present condition of the material on the exterior Same as Exter Qual Facade Ordinal
Foundation Type of foundation BrkTil: Brick & Tile / Cblock: Cinder Block / Pconc: Poured Contrete / Slab: Slab / Stone: Stone / Wood: Wood Foundation Nominal
Bsmt Qual Height of the basement Ex: Excellent (100+ inches) / Gd: Good (90-99 inches) / TA: Typical (80-89 inches) / Fa: Fair (70-79 inches) / Po: Poor (<70 inches) / NA: No Basement Basement Ordinal
Bsmt Cond General condition of the basement Ex: Excellent / Gd: Good / TA: Typical - slight dampness allowed / Fa: Fair - dampness or some cracking or settling / Po: Poor - Severe cracking, settling, or wetness / NA: No Basement Basement Ordinal
Bsmt Exposure Walkout or garden level basement walls Gd: Good Exposure / Av: Average Exposure (split levels or foyers typically score average or above) / Mn: Mimimum Exposure / No: No Exposure / NA: No Basement Basement Ordinal
BsmtFin Type 1 Quality of basement finished area GLQ: Good Living Quarters / ALQ: Average Living Quarters / BLQ: Below Average Living Quarters / Rec: Average Rec Room / LwQ: Low Quality / Unf: Unfinshed / NA: No Basement Basement Ordinal
BsmtFin SF 1 Type 1 finished square feet None Basement Numeric
BsmtFin Type 2 Quality of second finished area (if present) Same as BsmtFin Type 1 Basement Ordinal
BsmtFin SF 2 Type 2 finished square feet None Basement Numeric
Bsmt Unf SF Unfinished square feet of basement area None Basement Numeric
Total Bsmt SF Unfinished square feet of basement area None Basement Numeric
Heating Type of heating Floor: Floor Furnace / GasA: Gas forced warm air furnace / GasW: Gas hot water or steam heat / Grav: Gravity furnace / OthW: Hot water or steam heat other than gas / Wall: Wall furnace Utility Nominal
Heating QC Heating quality and condition Ex: Excellent / Gd: Good / TA: Average/Typical / Fa: Fair / Po: Poor Utility Ordinal
Central Air Central air conditioning N: No / Y: Yes Utility Ordinal
Electrical Electrical system SBrkr: Standard Circuit Breakers & Romex / FuseA: Fuse Box over 60 AMP and all Romex wiring (Average) / FuseF: 60 AMP Fuse Box and mostly Romex wiring (Fair) / FuseP: 60 AMP Fuse Box and mostly knob & tube wiring (poor) / Mix: Mixed Utility Ordinal
1st Flr SF First Floor square feet None Above Grade Numeric
2nd Flr SF Second floor square feet None Second Floor Numeric
Low Qual Fin SF Low quality finished square feet (all floors) None Above Grade Numeric
Gr Liv Area Above grade (ground) living area square feet None Overall Numeric
Bsmt Full Bath Basement full bathrooms None Bathroom Numeric
Bsmt Half Bath Basement half bathrooms None Bathroom Numeric
Full Bath Full bathrooms above grade None Bathroom Numeric
Half Bath Half baths above grade None Bathroom Numeric
Bedroom AbvGr Number of bedrooms above basement level None Bedroom Numeric
Kitchen AbvGr Number of kitchens None Kitchen Numeric
Kitchen Qual Kitchen quality Ex:Excellent / Gd: Good / TA: Typical/Average / Fa: Fair / Po: Poor Kitchen Ordinal
TotRms AbvGrd Total rooms above grade (does not include bathrooms) None Above Grade Numeric
Functional Home functionality rating Typ: Typical Functionality / Min1: Minor Deductions 1 / Min2: Minor Deductions 2 / Mod: Moderate Deductions / Maj1: Major Deductions 1 / Maj2: Major Deductions 2 / Sev: Severely Damaged / Sal: Salvage only Overall Ordinal
Fireplaces Number of fireplaces None Utility Numeric
Fireplace Qu Fireplace quality Ex: Excellent - Exceptional Masonry Fireplace / Gd: Good - Masonry Fireplace in main level / TA: Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement / Fa: Fair - Prefabricated Fireplace in basement / Po: Poor - Ben Franklin Stove / NA: No Fireplace Utility Ordinal
Garage Type Garage location 2Types: More than one type of garage / Attchd: Attached to home / Basment: Basement Garage / BuiltIn: Built-In (Garage part of house - typically has room above garage) / CarPort: Car Port / Detchd: Detached from home / NA: No Garage Garage Nominal
Garage Yr Blt Year garage was built None Age Numeric
Garage Finish Interior finish of the garage Fin: Finished / RFn: Rough Finished / Unf: Unfinished / NA: No Garage Garage Ordinal
Garage Cars Size of garage in car capacity None Garage Numeric
Garage Area Size of garage in square feet None Garage Numeric
Garage Qual Garage quality Ex: Excellent / Gd: Good / TA: Typical/Average / Fa: Fair / Po: Poor / NA: No Garage Garage Ordinal
Garage Cond Garage condition Same as Garage Qual Garage Ordinal
Paved Drive Paved driveway Y: Paved / P: Partial Pavement / N: Dirt/Gravel Driveway Ordinal
Wood Deck SF Wood deck area in square feet None Porch/Garden Numeric
Open Porch SF Open porch area in square feet None Porch/Garden Numeric
Enclosed Porch Enclosed porch area in square feet None Porch/Garden Numeric
3Ssn Porch Three season porch area in square feet None Porch/Garden Numeric
Screen Porch Screen porch area in square feet None Porch/Garden Numeric
Pool Area Pool area in square feet None Porch/Garden Numeric
Pool QC Pool quality Ex: Excellent / Gd: Good / TA: Average/Typical / Fa: Fair / NA: No Pool Porch/Garden Numeric
Fence Fence quality GdPrv: Good Privacy / MnPrv: Minimum Privacy / GdWo: Good Wood / MnWw: Minimum Wood/Wire / NA: No Fence Porch/Garden Nominal
Misc Feature Miscellaneous feature not covered in other categories Elev: Elevator / Gar2: 2nd Garage (if not described in garage section) / Othr: Other / Shed: Shed (over 100 SF) / TenC :Tennis Court / NA: None Overall Nominal
Misc Val $Value of miscellaneous feature None Overall Numeric
Mo Sold Month Sold None Age Nominal
Yr Sold Year Sold None Age Numeric
Sale Type Type of sale WD: Warranty Deed - Conventional / CWD: Warranty Deed - Cash / VWD: Warranty Deed - VA Loan / New: Home just constructed and sold / COD: Court Officer Deed/Estate / Con: Contract 15% Down payment regular terms / ConLw: Contract Low Down payment and low interest / ConLI: Contract Low Interest / ConLD: Contract Low Down / Oth: Other Administrative Nominal
SalePrice The property's sale price in dollars. The response variable. None Price Numeric

Data Cleaning

Merge Train and Test Set

To ensure similar structure of train and test sets, we will merge them into a single dataframe so that it is easier to perform data cleaning and feature engineering steps.

Missing Data

Let us visualize the missing data

Missing due to absence

For some features, the missing value could result due to the absence of the said feature. Let us investigate if this is indeed the case.

Pool QC

For pool, we look to see if the Pool Area feature is 0, if it is, it indicates no pool.

There are 2042 values, same as the missing count of Pool QC. We will replace missing values with NA

Now we perform the same analysis for other missing values

Misc Feature

For Misc Feature, the data dictionary indicates that NA refers to No misc feature, so we fill the missing values with NA

Alley

For Alley, the data dictionary indicates that NA refers to No Alley Access, so we fill the missing values with NA

Fence

Same for Fence

Fireplace Qu
Garage

Other than Garage Type, other garage related features seem to have same number of missing values, these are probably houses with no garage.

We will first take a look at the 2 rows with missing values for other garage features but not Garage Type

Seems like these are Garage Type with Detached Garage.

For the row from training data, we will drop it.

However, we cannot drop the row from test dataframe, so we replace the missing values with

Now we tackle the missing values for garage features with no garage. Similar to previous case, we will fill missing values with relevant values while taking note to prevent leakage by restricting replacement to either statistics from train or test set

Basement

For basement features, there are 3 observations with missing values for Bsmt Exposure but not others, let us take a closer look.

Since there are only 3 rows, we will filter out the rows

For other basement features with missing values, there are missing values, it is because there are no basements, as seen from the 0 values for Total Bsmt SF . Let us fill the missing values with NA.

Masonry veneer

For Mas Vnr Type and Area, we will impute the missing values as "None" and 0.

Electrical

Since this is an observation from the test set, we cannot drop it. Hence, we will fill it with the most frequent occurance.

Lot frontage

Lot frontage refers to the length of the street directly bordering the property. It is often the case that properties located along the same street have similar lots and styles. Properties located close to each other will have have closer PID (Parcel ID). It is also often the case that Lot Area, lot shape and Lot Config have some influence on the frontage. So we will fit a KNN model using the above mentioned features to predict the missing values for Lot Frontage.

There seems to some clustering for the observations so we will procced with the predictios base on the logic.

First we pick out all PIDs of with missing Lot Frontage values so its is easier to merge later.

Then we get the features columns for the relevant features.

We see that the distibutions are almost identical before and after imputing with KNNRegressor.

We have now dealt with all missing values

Feature Engineering

Let us view the current dataframe

Encoding Ordinal Variables

Some features consists of values which indicate some form of order, these are mainly features that place some form of judgement on the condition/quality. There is also Central Air features with 2 values indicating the presence or absence of central air.

Changing Year to Age

Instead of using year Year Built and Year Remod/Add, we could get the Age Since Built and Age Since Remod. Let us first see if the year features has any anomalies.

We have now dealt with the issue with Garage Yr Blt. Let us see if there are remod year which are earlier than built year, which would not make sense as well.

There is one such case, we will change Year Built to 2001.

We will create the new feature columns for age now.

We shall do some visualizatio for the year and age features with the saleprice.

We see that Age features has a slighly better correlation value compared to Year features. They also have the advantage of being a derivation of Sold and Built years. Therefore, we will drop the Year Built and Year Remod/Add. However, we will keep the Yr Sold feature as they may account for yearly sale price fluctuations. Even from the visualization, we see that there was a price drop in 2008 which may have been the 2008 recession.

Combine Features

We could combine the number of bathrooms to produce a feature call Total Bath

We will also add a Gross Flr Area feature which is the sum of Total Bsmt SF and Gr Liv Area

Additional features to indicate presence of housing components

We will add some additional features which just contains boolean values to indicate the presence (1) or absence (0) of certain housing components like fireplace, pool, etc.

Feature Selection

Separating Feature Types

We will separate the features into Numeric, Ordinal, Categorical and Non-Useful (such as Id) features and use different methods for processing and selection.

Numeric Features

Features with weak correlation to response and a high percentage of a single value

We now take a look at the relationship between numeric features and look for possible multicollinearity and features with weak correlations to Sale Price.

Seems like some features has majority of the values of 0. We will see what percentage of each feature has 0 value.

We will pick out features with a weak correlation (-0.1 < r 0.1) and a high percentage of a single value (>90%).

Features to drop using different thresholds

We have generated 3 features sets (for features to drop) by setting different threshold values.

Features with high multicollinearity

We will use a correlation value of >0.8 or <-0.8 as threshold for high multicollinearity.

Looking at the scatter plots, we will add additional features to the list of features to be dropped.

Numeric features which are derived from other features

Some features such as basement areas, living areas and number of bathrooms, have a sum total features which are derived from other features, we fit them through a simple model and see if the features are significant.

From the p-values, we see that most of the features we tested are significant, with the exception of Bsmt Unf SF with a p-value of 11.7%, we will drop this feature.

Ordinal features

Next we will examine the ordinal features by looking at some boxplots of the features against SalePrice.

We can see from the boxplots that Pool QC is unlikely to have significant impact on SalePrice.

For others, we do see some form of difference but could not judge whether they are significant, we will create additional feature sets with different features to drop.

Categorical Features

We will perform the same type of analysis on the categorical features.

Again, we will add some features to our features to drop sets and create additional feature sets. We create 3 extra groups of feature sets.

  1. Categorical features we are quite sure has little impact - add to existing drop lists
  1. Categorical features that might have little impact, but we are not so sure - create new feature sets
  1. Categorical features that has too many different unique values and not much variation, which may result in overfitting - we create new features on top of the previous set of features to drop

Just an extra look at the neighborhoods.

Other features with a dominant value

We have examined and identified numeric variables with a weak correlation and a high percentage of dominant values, we will now identify ordinal and categorical variables with a high percentage of a single dominant value and create 2 more feature groups. Those with more than 80% of a dominant value and those with a 90% of dominant value.

Modelling

Pre-processing

Reducing Skewness

First we take a look at the histogram of each feature to see if we can reduce the skewness of some features and make them more resembling of a normal distribution. Linear regression algorithms generally handle data which are more normally distributed better.

We will try to reduce the skewnesss of some features which approximately follows a normal distribution. We use a box cox transformation on (x + 1) as it is able to handle zero values. We will write a function that iterates through a range of lambda values and picks the lambda which minimizes our skewness for each feature

We have significantly reduced the skewness of selected features

Dummy creation

We can now start the modelling process with the 13 different features sets (includes base set with all features). First, we obtain dataframes by dropping different numbers of features, then we create dummy variables for categorical variables and re-split our feature sets into train and test.

Now let us review the 55 feature sets.

Model Fitting and Evaluation

Round 1

Model Fitting
Model Evaluation

We see that out of all the estimators, the three estimators (ridge, lasso, elastic net) with regularization performed comparatively, with Lasso regression having a slight edge across all feature sets. Linear estimators were clearly not able to handle the number of features, while KNN regressor resulted in signicantly larger error as well.

The best feature set turn out to be feature_set_16.

Round 2

We have identified feature_set_16 as the feature set that produce the lowest mean RMSE score. Now, we will do further hyperparameter tuning by running the feature set through lasso regressor which seems to produce the best result across estimators, we will use a more detailed search space to pick out the better hyperparameters.

Plot Actual SalePrice vs Predicted SalePrice for Training Set
Plot Residual Value vs Predicted SalePrice for Training Set

We see the our points are rather evenly spread about around the line of perfect prediction. On the higher end of the SalePrice, our linear model does tend to 'underpredict' for some points, while for two points in particular, the model "overpredicted" by quite a large margin. This indicates that our model still has room to improve in terms of predicting properties with higher prices. That being said, most of our sale prices do not fall on the higher ranges, so our model does a pretty good job.

Model Interpretation

Now let us review our best feature set. It consists of 174(after dummify) features and was a result of dropping the following features.

We will now interprete the coefficients of our result from 2 rounds of modelling and identify the most important features influencing Sale Price.

Top 15 Most important features with positive impact on Sale Price

Top 15 Most important features with negative impact on Sale Price

Here, we can easily visualize the most important factors having an impact on sale price, grouped by positive and negative impact.

A sidenote on the neighborhoods

Perhaps a good guide on location is by coming up with a ranking of neighborhoods, by the potential impact on sale price.

Conclusions and Recommendations

We see that the most important features influencing sale price can be grouped in a few categories:

Features with positive influence on price.

Area

Quality

Neighborhoods

Proximity

Roof material

Features with negative influence on price.

House Age

Neighborhoods

Zoning

Heating Type

Number of Kitchens

Basement

Garage type

Utilities

Proxmity

Submission

Kaggle Score

Kaggle Score: Kaggle score